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Abstract 

Background: Simple clustering methods such as hierarchical clustering and /c-means 
are widely used for gene expression data analysis; but they are unable to deal with 
noise and high dimensionality associated with the microarray gene expression data. 
Consensus clustering appears to improve the robustness and quality of clustering 
results. Incorporating prior knowledge in clustering process (semi-supervised 
clustering) has been shown to improve the consistency between the data partitioning 
and domain knowledge. 

Methods: We proposed semi-supervised consensus clustering (SSCC) to integrate the 
consensus clustering with semi-supervised clustering for analyzing gene expression 
data. We investigated the roles of consensus clustering and prior knowledge in 
improving the quality of clustering. SSCC was compared with one semi-supervised 
clustering algorithm, one consensus clustering algorithm, and /c-means. Experiments 
on eight gene expression datasets were performed using /i-fold cross-validation. 

Results: Using prior knowledge improved the clustering quality by reducing the 
impact of noise and high dimensionality in microarray data. Integration of consensus 
clustering with semi-supervised clustering improved performance as compared to 
using consensus clustering or semi-supervised clustering separately. Our SSCC method 
outperformed the others tested in this paper. 

Keywords: Semi-supervised clustering, Consensus clustering. Semi-supervised 
consensus clustering. Gene expression 
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Background 

Simple clustering methods such as agglomerative hierarchical clustering and /c-means 
have been widely used on gene expression data analysis. However, individual clustering 
algorithms have their limitations in dealing with different datasets. For example, /c-means 
is unable to capture clusters with complex structures, and selection of k value is somewhat 
challenge without subjectivity. Therefore, many studies used consensus clustering (also 
called cluster ensemble) to improve the robustness and quality of clustering results [1-4]. 

Consensus clustering solves a clustering problem in two steps. The first step, known 
as base clustering, takes a dataset as input and outputs an ensemble of clustering solu- 
tions. The second step takes the cluster ensemble as input and combines the solutions 
through a consensus function, and then produces final partitioning as the final output, 
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known as final clustering. The consensus clustering algorithms differ in chosen algo- 
rithms for basic clustering, consensus function and final clustering. Monti et al. used 
hierarchical clustering(HC) or self-organizing map (SOM) as the base clustering to gener- 
ate consensus matrix and either HC or SOM for final clustering [1]. Yu et al. used /:-means 
as the base clustering on subspace datasets and graph-cut algorithms for the final clus- 
tering [2]. Kim used /c-means as the base algorithm with random multiple number of 
clusters and applied a graph-cut algorithm for final clustering [3]. The base clustering 
generates diverse clustering solutions through: 1) generating subspace datasets using gene 
resampling [1,2,4]; 2) using a single clustering algorithm with random parameter initial- 
izations such as selecting a random number of clusters [3,4]; 3) using different clustering 
algorithms for each base clustering [5]. Some consensus clustering methods used a pair- 
wise similarity matrix of instances to combine multiple clustering solutions [1,2], others 
used associations between instances and clusters in the consensus matrix [4]. These con- 
sensus clustering algorithms usually outperform single clustering algorithms on gene 
expression datasets [1-4]. 

Consensus clustering has been used for clustering samples to discover and classify can- 
cer types in cancer microarray data [1-4,6]. It achieved successes in capturing informative 
patterns from microarray data [1-3]. A well known consensus clustering algorithm, link- 
based cluster ensemble (LCE) was introduced in [4]. LCE outperforms 10 algorithms 
tested in [4], specifically, four simple clustering algorithms, three pairwise similarity based 
consensus clustering algorithms, and three graph-based cluster ensemble techniques. 
Consensus clustering is also used for clustering genes to identify biologically informative 
gene clusters [5]. 

Many studies used prior knowledge in clustering genes [7-13]. These methods are 
referred as semi-supervised clustering approaches. The results showed that using small 
amount of prior knowledge was able to significantly improve the clustering results; also 
the more specific prior knowledge used the better in improving the quality of clustering. 

Consensus clustering itself can be considered as unsupervised and improves the 
robustness and quality of results. Semi-supervised clustering is partially supervised and 
improves the quality of results in domain knowledge directed fashion. Although there 
are many consensus clustering and semi-supervised clustering approaches, very few of 
them used prior knowledge in the consensus clustering. Yu et al. used prior knowledge 
in assessing the quality of each clustering solution and combining them in a consen- 
sus matrix [14]. In this paper, we propose to integrate semi-supervised clustering and 
consensus clustering, design a new semi-supervised consensus clustering algorithm, and 
compare it with consensus clustering and semi-supervised clustering algorithms, respec- 
tively. In our study, we evaluate the performance of semi-supervised consensus clustering, 
consensus clustering, semi-supervised clustering and single clustering algorithms using 
/z-fold cross-validation. Prior knowledge was used on h-l folds, but not in the testing 
data. We compared the performance of semi-supervised consensus clustering with other 
clustering methods. 

Method 

Our semi-supervised consensus clustering algorithm (SSCC) includes a base cluster- 
ing, consensus function, and final clustering. We use semi-supervised spectral clustering 
(SSC) as the base clustering, hybrid bipartite graph formulation (HBGF) as the consensus 
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function, and spectral clustering (SC) as final clustering in the framework of consensus 
clustering in SSCC. 

Spectral clustering 

The general idea of SC contains two steps: spectral representation and clustering. In spec- 
tral representation, each data point is associated with a vertex in a weighted graph. The 
clustering step is to find partitions in the graph. Given a dataset X = {xi\i = 1, . . . ,n} and 
similarity Sij > 0 between data points Xi and xjy the clustering process first construct a 
similarity graph G = (VfE)^ V = {v/}, E = {eij} to represent relationship among the data 
points; where each node v/ represents a data point xi, and each edge eij represents the con- 
nection between two nodes v/ and vy, if their similarity Sij satisfies a given condition. The 
edge between nodes is weighted by Sij. The clustering process becomes a graph cutting 
problem such that the edges within the group have high weights and those between differ- 
ent groups have low weights. The weighted similarity graph can be fully connected graph 
or ^-nearest neighbor graph. In fully connected graph, the Gaussian similarity function is 
usually used as the similarity function Sij = exp(— \\ Xi — xj \\^ /2(7^), where parameter a 
controls the width of the neighbourhoods. In ^-nearest neighbor graph, Xi and xj are con- 
nected with an undirected edge if Xi is among the ^-nearest neighbors of xj or vice versa. 
We used the ^-nearest neighbours graph for spectral representation for gene expression 
data. 

Semi-supervised spectral clustering 

SSC uses prior knowledge in spectral clustering. It uses pairwise constraints from the 
domain knowledge. Pairwise constraints between two data points can be represented as 
must-links (in the same class) and cannot-links (in different classes). For each pair of must- 
link (ij), assign Sjry = sji = 1, For each pair of cannot- link (/,;), assign s^y = sji = 0. 

If we use SSC for clustering samples in gene expression data using ^-nearest neighbor 
graph representation, two samples with highly similar expression profiles are connected 
in the graph. Using cannot-links means to change the similarity between the pairs of 
samples into 0, which breaks edges between a pair of samples in the graph. Therefore, 
only must-links are applied in our study. The details of SSC algorithm is described in 
Algorithm 1. Given the data points X\f • • • 1 

I pairwise constraints of must-link 
are generated. The similarity matrix S can be obtained using similarity function 
Sij = exp (— II Xi — Xj Ip /2a^). a is the scaling parameter for measuring when two 
points are considered similar, and was calculated according to [15]. Then S is modified 
to be a sparse matrix, only t nearest neighbors are kept for each data point in S, Then, 
/ pairwise constraints are applied in S, Steps 5-10 follow normalized spectral clustering 
algorithm [16,17]. 

Consensus function 

We used LCE ensemble framework in our SSCC adopting HBGF as the consensus func- 
tion. The cluster ensemble is represented as a graph that consists of vertices and weighted 
edges. HBGF models both instances and clusters of the ensemble simultaneously as 
vertices in the graph. This approach retains all information provided by a given ensemble, 
allowing the similarities among instances and among clusters to be considered collectively 
in forming the final clustering [18]. More details about LCE can be found in [4]. 
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Algorithm 1: Semi-supervised spectral clustering (SSC) 

Input: Given n data points the number of clusters ky and the number 

of pairwise constraints /. 

Output: Group x\,...,Xn into k clusters. 

1. Generate / must-link constraints from xi,...,Xn^ 

2. Construct a similarity matrix S where Sij > 0 represents the similarity between 
Xi and xj. 

3. Modify 5 to be a sparse matrix using t-nearest neighbor graph. 

4. Apply / pairwise constraints on S, Sij = sji = 1. 

5. Compute the normalized Laplacian matrix L = I — D~^^^SD~^^^, The degree 
matrix D is defined as the diagonal matrix with the degrees di, . . . y dn on the 
diagonal, di = Y!l=i Sij^ 

6. Compute the first k eigenvectors ui, . . . ,Uk oiL. 

7. U ^ W^^^ to be matrix containing the vectors wi, . . . , as columns. 

8. Form the matrix T G M^^^ from U by normalizing the rows to norm 1. 



10. Cluster of the points (yO with /:-means algorithm into k clusters. 



Semi-supervised consensus clustering 

To make a consensus clustering into a semi-supervised consensus clustering algorithm, 
prior knowledge can be applied in base clustering, consensus function, or final clustering. 
Final clustering is usually applied on the consensus matrix generated from base clustering. 
SSCC uses semi-supervised clustering algorithm SSC for base clustering, does not use 
prior knowledge either in consensus function or final clustering. Our experiment was 
performed using /z-fold cross-validation. The dataset was split into training and testing 
sets, and the prior knowledge was added to the h — 1 folds training set. After the final 
clustering result was obtained, it was evaluated on the testing set alone. The influence of 
prior knowledge could be assessed in a cross-validation framework. 

Our semi-supervised consensus clustering algorithm is described in Algorithm 2. Simi- 
lar to [4], for a given ny.d dataset of n samples and d genes, a « x ^ data subspace {q < d) 
is generated by 



a g[0, 1] is a uniform random variable, qmin and qmax are the lower and upper bonds of 
the subspace. qmin and qmax are set to 0.1 Sd and 0.85(i. Let W = tti, . . . , 7Zm be a cluster 
ensemble with m clustering solutions. SSC is applied on each subspace dataset to obtain 
clustering results. We use the fixed number of clusters each tt/ = Cp . . . , is one 
clustering solution. A basic cluster-association matrix BM is generated at first based on 
the crisp associations between samples and clusters using HBGF, in which there are n 
samples and m x k clusters. If Xi belongs to a cluster Cy, BM(xi, Cj) = 1, / = 1, . . . , 
/ = 1, . . . otherwise BM{Xi, Cj) = 0. Next, a refined cluster-association matrix RM is 
generated from BM by estimating new association values in RM(Xif Cj) \iBM{Xif Cj) = 0. 



9. 




be the vector corresponding to the i-th row of T. 



q — qmin H" \S^(.^max ^min)\ 



(1) 
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RM{xi, Cj) is the similarity between Cy and other clusters to which Xi probably belongs. 
The similarity of any clusters in the cluster ensemble is obtained from a weighted graph 
of clusters. Finally, spectral clustering is applied on RM to obtain the final clustering 
solution. 



Algorithm 2: Semi-supervised consensus clustering (SSCC) 

Input: Given a gene expression nya d dataset with n samples and d genes. 

Set the number of clusters /c, the number of pairwise constraints /, ensemble size m, 
and the number of folds h in cross-validation. 
Output: Group xi, . . .,Xn into k clusters. 

1. In each run, split the data into h fold. In each fold, run steps 2-5. 

2. Generate / pairwise constraints of must-link from the other h — 1 fold data points. 

3. Generate a cluster ensemble Y\ = ^ly • -y^m with m clustering solutions, 

= C[f . . . C^. 

(a) Generate m subspace datasets 5^,^=1,. 

(b) Apply algorithm 1 SSC steps 2-10 on ^1, . . . , with the fixed number 
of clusters k, and get tt^. 

(c) Store 7ti in the cluster ensemble Yl- 

4. Generate a cluster-association matrix RM from Yl- 

5. Apply spectral clustering on RM and cluster the datasets into k clusters. 



Results 

Selected algorithms 

We compared the performance of four algorithms: SSCC, SSC [19], LCE [4], and /:-means 
(Table 1). The performance of SSCC was influenced by amount of prior knowledge, con- 
sensus function and base clustering. By increasing the amount of prior knowledge, we 
observed the influence of prior knowledge on SSCC. SSCC uses SSC as the base clus- 
tering. By comparing SSCC with SSC on the same amount of prior knowledge, we were 
able to observe the influence of consensus clustering on SSCC. Same as LCE, SSCC uses 
HBGF as the consensus function. SSCC became a consensus clustering algorithm when 
it did not use prior knowledge, /c-means was used as the baseline algorithm in this study. 
In both SSCC and LCE, we used subspace and fixed number of clusters, ensemble size of 
10, and nearest neighbor size of 5. We implemented SSCC in Matlab and adopted Matlab 
code of SSC [20], LCE [4] and /:-means. 



Table 1 Attributes of four clustering algorithms 



Clustering 
algorithms 


Type 


Base 

clustering 


Final 

clustering 


Consensus 
function 


Using prior 
knowledge 


/c-means 


Simple clustering 


/c-means 






No 


LCE 


Consensus clustering 


/c-means 


SC 


HBGF 


No 


SSC 


Semi-supervised clustering 


SC 






Yes 


SSCC 


Semi-supervised consensus clustering 


SSC 


SC 


HBGF 


Yes 



Wang and Pan BioData Mining 2014, 7:7 
http://www.biodatamining.0rg/content/7/l/7 



Page 6 of 1 3 



Datasets 

All four algorithms were tested with eight cancer gene expression datasets (Table 2). 
These were processed datasets after removing the non-informative genes and obtained 
from [21]. Prior knowledge was represented as pairwise constraints generated from class 
labels. Prior knowledge in the eight datasets was derived from sample class labels. A pair 
of samples share the same class were given a must-link prior knowledge. We used a small 
amount of prior knowledge to test the effectiveness of SSCC (Table 2). 

Performance measures 

The performance was measured with normalized mutual information (NMI) [29] and 
adjusted rand index ( ARI) [30] . ARI is often used to assess the performance of clustering 
samples in gene expression datasets [1-4]. The definition of NMI is described as follows. 
Let X and Y be the random variables described by the cluster assignments and class labels. 
I{Xy Y) denotes the mutual information between X and Y) H(X) and H(Y) the entropy of 
X and Y. NMI is defined by 

KX^Y) 

NMI(X, Y) = , 2 
^H(X)H(Y) 

Experimental results 

The experiments were performed by increasing number of pairwise constraints with 5 
fold cross validation and 50 runs (Figures 1, 2). 

Without prior knowledge, comparisons of SSCC, SSC, LCE and /c-means was per- 
formed by using one-way ANOVA with Bonferroni correction (p < 0.05) on NMI and 
ARI (Table 3 and Additional file 1). We used paired t-test (p < 0.05) to compare SSCC 
and SSC with prior knowledge on NMI and ARI, respectively. The null hypothesis was 
that no difference existed between the mean of SSCC and SSC. We used 20 pair-wise con- 
straints for CNS, Leukemia 1, Leukemia2 and Leukemia3, but 100 constraints for other 4 
datasets (Table 4). 

Our result clearly demonstrated that consensus clustering and using prior knowledge 
both contribute to improving the quality of clustering and an integration of both per- 
formed even better (Figures 1, 2 and Tables 3, 4). Without injection of prior knowledge, 
performance of SSCC and SSC were more or less equivalent, but both were signifi- 
cantly better than LCE and /:-means (Table 3). On the other hand, with injection of prior 
knowledge, SSCC significantly outperformed SSC (Table 4). 



Table 2 Cancer gene expression datasets used in experiments 



Dataset 


Samples 


Original 
probes 


Selected 
probes 


Classes 


Constraints 
number 


Constraints 
% in total 


CNS [22] 


42 


7129 


1379 


5 


20 


2.2% 


Leukemial [23] 


72 


7129 


1877 


2 


20 


0.77% 


Leukemia2 [23] 


72 


7129 


1877 


3 


20 


0.77% 


Leu kern ia3 [24] 


72 


12582 


2194 


3 


20 


0.77% 


LungCancer [25] 


203 


12600 


1543 


5 


100 


0.48% 


StJude [26] 


248 


12625 


2526 


6 


100 


0.32% 


Multi-Tissuel [27] 


174 


12533 


1571 


10 


100 


0.66% 


Multi-Tissue2 [28] 


190 


16063 


1363 


14 


100 


0.55% 



Wang and Pan BioData Mining 2014, 7:7 
http://www.biodatannining.0rg/content/7/l/7 



Page 7 of 13 



s 

z 

0.75 



0.55 

i 



« 1 



0 10 20 30 40 SO 

0.48 
0.46 
0.44 



0 20 40 60 80 100 



60 80 100 




0 20 40 60 80 100 




0 20 40 60 80 100 



0.6 ] i- - 4 - -I- - I- - ■£ - -I 
0.55 

0 20 40 60 80 100 



The number of constraints 



-ssc — sscc 



Figure 1 Normalized mutual information with various numbers of constraints on (A) CNS 
(B) Leukemial (C) Leukemia2 (D) Leukemia3 (E) LungCancer (F) St. Jude (G) Multi-Tissuel 
(H) IVlulti-Tissues2 datasets (Error bars show 95% confidence interval). 



Parameter analysis 

Ensemble size was one of important parameters that influence SSCC and LCE (Figure 3). 
SSCC significantly outperformed LCE in all ensemble size settings across the 8 datasets 
excepting size 40 and 50 on LeukemiaS. In some datasets, the performance of SSCC or 
LCE is improved with the increase of ensemble size from 10 to 20. However, there is no 
significant improvement in other datasets such as Multi-Tissuel and Multi-Tissue2. In 
such case we suggest a small ensemble size, such as 10. 

Influence of ensemble type appeared to be more obvious (Figure 4). We compared the 
performance of two ensemble types, "Fixed k + Subspace" and "Random k + Full-space", 
on SSCC and LCE. SSCC outperformed LCE with both ensemble types in majority of the 
8 datasets. SSCC with "Fixed k + Subspace" appeared to be generally better than other 
combinations. 




0 10 20 30 40 50 0 20 40 60 80 100 0 20 40 60 80 100 o 20 40 60 80 100 




0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 o 20 40 60 80 100 

The number of constraints 

SSC —SSCC LCE --K-Means 



Figure 2 Adjusted rand index with various numbers of constraints on (A) CNS (B) Leukemial 
(C) Leukemia2 (D) LeukemiaS (E) LungCancer (F) St. Jude (G) Multi-Tissuel (H) IVIulti-Tissues2 
datasets (Error bars show 95% confidence interval). 
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Table 3 Without prior knowledge, comparison among SSCC, SSC, LCE, and ^-means 







NMI 






ARI 






SSC 


LCE 


/c-means 


SSC 


LCE 


/c-means 


sscc 


4/4/0 


7/1/0 


8/0/0 


4/3/1 


7/1/0 


8/0/0 


ssc/sc 




6/2/0 


8/0/0 




6/2/0 


6/2/0 


LCE 






6/2/0 






5/3/0 



All results are summarized in w/t/l, i.e. the first algorithm wins w times, ties t times and loses I times. 



Performance of both SSCC and SSC was significantly influenced by neighborhood size 
(Figure 5). Without applying prior knowledge, we conducted paired two- tailed t-test 
(p < 0.05) between SSCC and SSC under four different t values. In majority of the 
datasets, both algorithms performed better with smaller neighborhood size. Generally, 
SSCC outperformed SSC. 

Discussion 

We compared the performance of SSCC with SSC, LCE and /c-means and each of our 
pairwise comparison provides information of the effect of either semi-supervision or con- 
sensus clustering. Specifically, comparing LCE with /:-means reveals the effectiveness of 
ensemble strategy since k-mems is used as the base clustering in LCE. Similarly, in com- 
paring SSC with SSCC, we used the same amount of prior knowledge, so actually we 
compared spectral clustering with consensus clustering. The comparison between SSCC 
and LCE reveals the effect of semi-supervision under the consensus clustering paradigm. 

SSCC significantly outperforms SSC with or without prior knowledge. This clearly 
shows that consensus clustering algorithms outperform single clustering algorithms in 
the gene expression datasets. This observation is consistent with [1-4]. 

We compared SSCC with LCE using the same datasets and same parameter settings. 
Without considering prior knowledge, the difference between SSCC and LCE is in base 
clustering, SSCC uses spectral clustering but LCE uses /:-means. They both use spec- 
tral clustering for final clustering (Table 1). Without prior knowledge, SSC becomes SC, 
and SC outperforms /:-means in all 8 datasets (Figures 1, 2 and Table 3). This indicates 



Table 4 With prior knowledge, paired t-test for the mean difference between SSCC and SSC 




NMI 


ARI 


CNS 


0.041^ 


0.097* 


Leukemial 


0.056^ 


0.053* 


Leukemia2 


0.094^ 


0.143* 


LeukemiaS 


0.024^ 


0.031* 


Lungcancer 


0.018^ 


-0.037* 


StJude 


0.009^ 


0.0144* 


MultiTissuel 


0.002 


0.007 


MultiTissue2 


0.012* 


0.035* 




SSCC vs. SSC 


SSCC vs. SSC 


w/t/l 


7/1/0 


6/1/1 



*The mean difference (SSCC - SSC) is significant at p < 0.05 level. The results are summarized in w/t/l, i.e. the first algorithm 
wins w times, ties t times and loses I times. 
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Figure 3 Normalized mutual information of SSCC and LCE with the change of ensemble size on eight 
datasets. 



the performance of base clustering has significant influence on results of consensus 
clustering. 

SSCC consists of spectral clustering and LCE. The majority of computational time of 
spectral clustering spends on finding t nearest neighbors [20]. The time complexity of 
obtaining t nearest neighbor sparse matrix is OirP'd) + OirP' log t), where n is the number 
of samples, d is the number of genes in the graph of spectral clustering. We use the fixed 
number of cluster k in LCE, the time complexity of generating a cluster-association matrix 
R is 0(m^k^ + nmk) + 0(m^k^f + nmk), where m is ensemble size, and t' is the average 
number of neighbors connecting to one cluster in a network of clusters in final clustering. 
In SSCC, the complexity of generating / pairwise constraints is 0(/). The overall time 
complexity of SSCC using "Fixed k + subspace" ensemble type is 

0(/) + O [mn^d) + O [mn^ log t) + O (rn^k^ + nmk) + O (rn^k^t' + nmk) 



■ SSCC_RandomK+FuilSapce ■ LCE_RandomK+FuilSapce 

■ SSCC_FixedK+SubSapce ■ LCE_FixedK+SubSapce 

1 n 




Figure 4 Normalized mutual information of SSCC and LCE with two ensemble types on eight datasets. 
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Since n > m, n > k, d > n, d > I, and d > tin our experiments, the bottle neck of SSCC 
is to find t nearest neighbors with computational time 0{mrP'd). The implementation of 
spectral clustering is a parallel algorithm [20], so the majority of computational time of 
SSCC can be reduce to O {^^^^y^^i where p' is the number of parallel threads. SSCC is 
limited to large data set due to the computational complexity of spectral clustering. SSCC 
can be improved by adopting faster spectral clustering algorithms, which are applicable 
for data sets with thousands of instances. 

Our study provided an insight into the contribution of consensus clustering and semi- 
supervised clustering to the clustering results. To our knowledge, the Knowledge based 
Cluster Ensemble (KCE) [14] is the only algorithm using prior knowledge in consensus 
clustering paradigm for gene expression datasets. Unfortunately, we are unable to directly 
compare SSCC with KCE because of the unavailability of the software. 

Our study uses SSCC for clustering samples. Since the optimal number of clusters {k 
in /c-means algorithm) and the class label of each sample are known, the prior knowl- 
edge is derived from the given class structure. A must-link constraint is given to a pair 
of samples if they are from the same class. For many real applications, we might not 
know the whole class structure, but most likely we know whether some of samples are 
in the same class (cluster). We can generate must-links between these samples, and prior 
knowledge is derived from these samples. In these cancer gene expression datasets, we 
validate the performance of SSCC with the labeled data. The next step would be to apply 
SSCC for clustering genes for gene function prediction. However, the performance on 
clustering genes might vary due to two reasons: the quality of prior knowledge and the 
optimal number of clusters. Pairwise constraints in this study have been generated from 
class labels of samples in the cancer gene expression datasets and they are true prior 
knowledge. Prior knowledge in clustering of genes will be known gene functions, and 
they are partial domain knowledge. A gene may have multiple functions; some func- 
tions are inclusive to others as well. For example, a level 6 gene ontology term apoptotic 
process (G 0:00069 15) has over ten thousands of gene products and under which at 
level 7, there are 21 GO terms. Our earlier work shows that more specific (higher level) 
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GO term contribute better to semi-supervised clustering result [13]. Also the descrip- 
tion of a certain gene function is based on current knowledge in the domain field. Such 
domain knowledge is often subject to change. For example, current knowledge of cer- 
tain existing gene is limited and will gradually be enriched. Therefore, the generated 
prior knowledge from a pair of genes most likely contains certain noise and subsequently 
influence the results. The optimal number of clusters is often unknown and a differ- 
ent distance measure would generate a different optimum number of clusters. Therefore, 
for comparison of semi-supervised clustering algorithms, it is better to use defined 
prior knowledge, such as the sample labels we used in this paper. When an algorithm 
considered to be superior over the others, such an algorithm can be used to cluster 
genes. 

In reality, obtaining large amount of prior knowledge for gene expression datasets is 
difficult. Designing algorithms which work best with a small amount of prior knowledge, 
such as less than 20 pairwise constraints, will be very useful for clustering microarray data. 
A study on semi-supervised clustering shows that with small amounts of prior knowledge, 
search-based approach tends to outperform similarity-based [31]. With larger amounts 
of labeled data, similarity-based tends to perform better. Combining both approaches 
outperforms respective individual approaches. SSC is a similarity-based semi-supervised 
clustering algorithm. The results in Figures 1, 2 show that the performance of SSCC and 
SSC is slightly improved with small numbers of constraints and significantly improved 
with increasing numbers of constraints. Our SSCC method presented in this paper is 
applicable not only to gene expression data, but also to other types of data as long as prior 
knowledge is provided. 

Conclusions 

In this study, we proposed a new semi-supervised consensus clustering method, designed 
an algorithm, and compared it with another semi-supervised clustering algorithm, a 
consensus clustering algorithm and a simple clustering algorithm on eight real cancer 
gene expression datasets. In general, using prior knowledge improves the performance 
of clustering in gene expression datasets. Consensus clustering is able to reach the 
goal of maximizing intra-cluster similarity and minimizing inter-cluster similarity. Also, 
using prior knowledge enhances the high consistency between data partitioning and 
domain knowledge. A combination of both significantly improves the quality of clus- 
tering. SSCC outperforms the semi-supervised clustering algorithm SSC and consensus 
clustering algorithm LCE in most datasets over various parameter settings, ensem- 
ble size and type, with or without prior knowledge. This study demonstrates that 
SSCC is an effective and robust semi-supervised consensus clustering algorithm with 
prior knowledge, and also a superior consensus clustering algorithm without prior 
knowledge. 
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